Rescorla-Wagner model

Introduction

In this paper, we will study the basics of the Rescorla-Wagner model, which is a model designed to predict the presence or the absence of an occurence (typically a reward) depending on other occurences (stimulus/i). In this model, we suppose that the animal is continously predicting the presence or the absence of a reward. This is different to the classical pavlovian model of conditioning and the skinnerian model of conditionning in which the animal learns by associations. We will see in this work that in the Rescorla-Wagner model, the animal doesn't learn by associations but by violation of the prediction. In this model, something is learned only by the minimization of the prediction error.



In [1]:

    
%matplotlib inline
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 6)









    



Populating the interactive namespace from numpy and matplotlib

First, let us create an 'environment' for the model. In this virtual environment, during the first 25 steps, both stimulus u and reward r will be present (1), in the last 25 steps, the reward will be removed (0) but the stimulus will continue to be present (1) :



In [2]:

    
import matplotlib
from matplotlib import pyplot as plt
import numpy as np

u = []  #  List of stimuli
r = []  #  List of rewards

time = np.arange(0, 50, 1)  #  Set the x axis

for i in range(50):  #  Iterate 50 times
    ui = 1  #  Stimulus i is present
    u.append(ui)
    if i < 25:
        ri = 1  #  Reward i is present
    else:
        ri = 0  #  Reward i is absent
    r.append(ri)

#  Plot
plot = plt.subplot()
plot.plot(time, r, 'ro', label='Reward')
plot.plot(time, u, 'gx', label='Stimulus')
plot.legend(bbox_to_anchor=(1, 0.9))
plt.title('Presence and absence of stimulus and rewards over time.')
plt.xlabel('Time')
plt.ylabel('Presence : 1, Absence : 0')
plt.ylim(-0.04, 1.04)
plt.show()

Now, we will create a variable v which represents the prediction of the reward. If v is close to 0, then it means that the model doesn't predict the reward given the stimulus u. On the contrary, if v is close to 1, then the model predicts the presence of the reward. To compute the prediction, we apply the following rule :

$$ v = wu $$

This prediction depends of a variable weight w which is the learning parameter. This parameter will progressively learn or unlearn at each step. The model learns by trial and error and more specifically by minimization of the prediction error $ \delta $. In order to compute delta we apply the following rule :

$$ \delta = r-v $$

The prediction error depends on the presence or absence of the reward minus the prediction. It means that if the prediction is close to 0 and the reward is present, the prediction error will be positive. On the contrary, if the prediction is close to 1 and the reward is absent, the prediction error will be negative.

To update the weight parameter we need a learning rate parameter $ \epsilon $. With this learning parameter we can finally apply the Rescorla-Wagner rule in order to update w :

$$ w = w + \epsilon \delta u $$

In this equation the varying part is the second part : $\epsilon \delta u$. In the case in which $\delta$ is positive, the Rescorla-Wagner rule will increase w which means that the system will learn that a reward should be expected in those conditions. On the contrary, if $\delta$ is negative, the parameter w will decrease, which means that the model will learn that the reward should not be expected in those conditions.

Let us now create such a model with the environment created sooner in which in the first 25 steps, both the stimulus and the reward are present, but only the stimulus is present within the last 25 steps :



In [3]:

    
from matplotlib import pyplot as plt
import numpy as np

v = 0  #  Prediction
w = 0  #  Weight parameter
epsi = 0.1  #  Learning rate
delta = 0  #  Prediction error

time = np.arange(0, 50, 1)  #  Time axis
yv = []  #  List of predictions

for i in range(50):
    if i < 25:
        u = 1  #  Stimulus
        r = 1  #  Reward
    else:
        u = 1  #  Stimulus
        r = 0  #  No reward
    v = w*u  #  Compute the prediction
    delta = r-v  #  Compute the prediction error
    w = w+epsi*delta*u  #  Compute the weight using rescorla-wagner rule
    yv.append(v)

#  Plot
plt.bar(time, yv, width=1)
plt.title('Prediction of reward over time')
plt.xlabel('Time')
plt.ylabel('Prediction (v)')
plt.show()

As we can see, in the first 25 steps, the model learns that if the stimulus u is present then the reward r is present too, because it's always the case. The model is now conditionned to predict the presence of r if u is present. But in the last 25 steps, the reward disappears. The stimulus still being present causes the model to predict the reward. The reward being absent, the delta error is negative then the parameter w now diminishes. Progressively an extinction occurs : the model unlearns that the stimulus u is codependent to the presence of the reward r, and learns now that the presence of the stimulus u implies the absence of reward r.

The learning parameter accelerates or decelerates the learning of the model.The higher $\epsilon$ is, the faster the learning will be. The smaller $\epsilon$ is, the slower the learning will be. In order to show that, let us look at the result of the following code :



In [4]:

    
from matplotlib import pyplot as plt
import numpy as np

v = 0  #  Prediction
w = 0  #  Weight parameter
epsis = [0.05, 0.1, 0.2]  #  Learning rates to test
delta = 0  #  Prediction error

time = np.arange(0, 50, 1)  #  Time axis used as x axis
yvs = []  #  Nested list of prediction evolutions

for n in range(len(epsis)):
    epsi = epsis[n]
    yv = []
    for i in range(50):
        if i < 25:
            u = 1  #  Stimulus
            r = 1  #  Reward
        else:
            u = 1  #  Stimulus
            r = 0  #  No reward
        v = w*u  #  Compute the prediction
        delta = r-v  #  Compute the prediction error
        w = w+epsi*delta*u  #  Compute the weight using rescorla-wagner rule
        yv.append(v)
    yvs.append(yv)

#  Plot
plot = plt.subplot()
plot.bar(time, yvs[2], label='Epsilon : 0.2', width=1, color='green', alpha=0.25)
plot.bar(time, yvs[1], label='Epsilon : 0.1', width=1, color='blue', alpha=0.25)
plot.bar(time, yvs[0], label='Epsilon : 0.01', width=1, color='red', alpha=0.25)
plot.legend(bbox_to_anchor=(1, 0.9))
plt.title('Prediction of reward over time')
plt.xlabel('Time')
plt.ylabel('Prediction (v)')
plt.show()

The reason to this variation of learning speed by $\epsilon$ is to be found in the second part of the Rescorla-Wagner rule. The parameter $\epsilon$ acts as a weight for the $\delta u$ part. Then, the higher $\epsilon$ is, the more the varying part of the Rescorla-Wagner rule $\epsilon \delta u$ will be high, then the more quickly the w parameter will adapt to the environment conditions. On the contrary, the lower $\epsilon$ is, the less the varying part of Rescorla-Wagner rule will be high, then the parameter w will update slower and adapt slower to the environment conditions. Extinction is also impacted by the learning parameter since extinction is only caracterized by a negative $\epsilon \delta u$.

Partial conditioning

Now let's see the partial conditioning, which means a conditioning but with random occurrences of reward. The reward here is not dependent on the time, but randomly appears with a probability of 0.4 :



In [5]:

    
import random
from matplotlib import pyplot as plt
import numpy as np

v = 0  #  Prediction
w = 0  #  Weight parameter
epsi = 0.1  #  Learning rate
delta = 0  #  Prediction error

time = np.arange(0, 50, 1)  #  Time axis
yv = []  #  List of predictions
yr = []  #  List of rewards

for i in range(50):
    u = 1  #  Stimulus
    rand = random.random()
    if rand <= 0.4:
        r = 1  #  Reward
    else:
        r = 0  #  No reward
    v = w*u  #  Compute the prediction
    delta = r-v  #  Compute the prediction error
    w = w+epsi*delta*u  #  Compute the weight using rescorla-wagner rule
    yv.append(v)
    yr.append(r)

#  Plot
plot = plt.subplot()
plot.bar(time, yv, label='Prediction', width=1, color='blue')
plot.plot(time, yr, 'ro', label='Reward')
plot.legend(bbox_to_anchor=(1, 0.92))
plt.title('Prediction of reward over time')
plt.xlabel('Time')
plt.ylabel('Prediction (v)')
plt.ylim(-0.04, 1.04)
plt.show()

With a 40% chance occuring reward, the model still learns but its prediction varies randomly around 0.4. Because the occurence of the reward is random and that the model always learns from its error, the model cannot completely stabilizes around 0.4 but oscillates randomly around this value. To show that it is around 0.4 that the prediction oscillates, let us plot the same, but for 2000 steps and with the evolution of the average of all predictions.



In [6]:

    
import random
from matplotlib import pyplot as plt
import numpy as np

v = 0  #  Prediction
w = 0  #  Weight parameter
epsi = 0.1  #  Learning rate
delta = 0  #  Prediction error

time = np.arange(0, 2000, 1)  #  Time axis
yv = []  #  List of predictions
yaverage = []  #  List of predictions averages

for i in range(2000):
    u = 1  #  Stimulus
    rand = random.random()  #  Run random
    if rand <= 0.4:
        r = 1  #  Reward
    else:
        r = 0  #  No reward
    v = w*u  #  Compute the prediction
    delta = r-v  #  Compute the prediction error
    w = w+epsi*delta*u  #  Compute the weight using rescorla-wagner rule
    yv.append(v)
    average = np.mean(yv)  #  Compute the average of predictions
    yaverage.append(average)

#  Plot
plot = plt.subplot()
plot.plot(time, yv, '-b', label='Prediction', linewidth=2)
plot.plot(time, yaverage, '-r', label='Prediction average', linewidth=2)
plot.legend(bbox_to_anchor=(1, 0.92))
plt.title('Prediction of reward over time')
plt.xlabel('Time')
plt.ylabel('Prediction (v)')
plt.ylim(0, 1)
plt.show()

The average of the prediction first increases because the model still hasn't learned. Then it saturates at 0.4. This is nothing else than the law of large numbers which told us that the average of an event with a probability of x, progressively tends to x with an increasing number of trials.

Blocking

The blocking phenomenon is an empirical phenomenon found in ethology. It happens when we first condition a stimulus u1 then add another stimulus u2 but keeping u1. In this case, the animal doesn't learn that u2 is correlated with r because he already predicts that r will be present from the stimulus u1. In consequence, the $\delta$ will be small or equal to 0, and so will be the learning. In other words : his prediction not being violated, he doesn't learn. Let us plot the phenomenon :



In [7]:

    
from matplotlib import pyplot as plt
import numpy as np

v = 0  #  Prediction
w1 = 0  #  Weight 1
w2 = 0  #  Weight 2
epsi = 0.1  #  Learning rate
delta = 0  #  Prediction error

time = np.arange(0, 50, 1)  #  Time axis
yv = []  #  List of predictions
yu1 = []
yu2 = []
yw1 = []
yw2 = []

for i in range(50):
    if i < 25:
        u1 = 1  #  Stimulus 1
        u2 = 0  #  No stimulus 2
        r = 1  #  Reward
    else:
        u1 = 1  #  Stimulus 1
        u2 = 1  #  Stimulus 2
        r = 1  #  Reward
    v = w1*u1+w2*u2  #  Compute the prediction
    delta = r-v  #  Compute the prediction error
    w1 = w1+epsi*delta*u1  #  Compute Rescorla-Wagner rule for w1
    w2 = w2+epsi*delta*u2  #  Compute Rescorla-Wagner rule for w2
    yv.append(v)
    yu1.append(u1)
    yu2.append(u2)
    yw1.append(w1)
    yw2.append(w2)

plot = plt.subplot()
plot.bar(time, yv, label='v', width=1, color='blue')
plot.plot(time, yw1, '-g', label='w1', linewidth=2)
plot.plot(time, yw2, '-r', label='w2', linewidth=2)
plot.plot(time, yu1, 'xg', label='u1')
plot.plot(time, yu2, '+r', label='u2')
plot.legend(bbox_to_anchor=(1, 0.5))
plt.title('Prediction of reward over time')
plt.xlabel('Time')
plt.ylabel('Prediction (v)')
plt.ylim(-0.04, 1.04)
plt.show()

To see which stimulus is implied in the prediction of the reward, we decided to plot the weight of both of them. We can see, that when u2 starts to occurs, w2 increases negligibly. This happens because the prediction using u1 isn't yet equal to 1, then there is always space for a positive delta, which means that w1 and w2 will both increase until prediction equals 1. We can see this phenomenon of partial blocking if we repeat the previous code but with a varying epsilon.

With $\epsilon$ = 0.2 :



In [8]:

    
from matplotlib import pyplot as plt
import numpy as np

v = 0  #  Prediction
w1 = 0  #  Weight 1
w2 = 0  #  Weight 2
epsi = 0.2  #  Learning rate
delta = 0  #  Prediction error

time = np.arange(0, 50, 1)  #  Time axis
yv = []  #  List of predictions
yu1 = []
yu2 = []
yw1 = []
yw2 = []

for i in range(50):
    if i < 25:
        u1 = 1  #  Stimulus 1
        u2 = 0  #  No stimulus 2
        r = 1  #  Reward
    else:
        u1 = 1  #  Stimulus 1
        u2 = 1  #  Stimulus 2
        r = 1  #  Reward
    v = w1*u1+w2*u2  #  Compute the prediction
    delta = r-v  #  Compute the prediction error
    w1 = w1+epsi*delta*u1  #  Compute Rescorla-Wagner rule for w1
    w2 = w2+epsi*delta*u2  #  Compute Rescorla-Wagner rule for w2
    yv.append(v)
    yu1.append(u1)
    yu2.append(u2)
    yw1.append(w1)
    yw2.append(w2)

plot = plt.subplot()
plot.bar(time, yv, label='v', width=1, color='blue')
plot.plot(time, yw1, '-g', label='w1', linewidth=2)
plot.plot(time, yw2, '-r', label='w2', linewidth=2)
plot.plot(time, yu1, 'xg', label='u1')
plot.plot(time, yu2, '+r', label='u2')
plot.legend(bbox_to_anchor=(1, 0.5))
plt.title('Prediction of reward over time')
plt.xlabel('Time')
plt.ylabel('Prediction (v)')
plt.ylim(-0.04, 1.04)
plt.show()

In this simulation, because of an higher learning rate, w1 increases until v = 1 before the 25th step. This implies there is no more space for a prediction error and then the $\epsilon \delta u$ = 0. This implies that u2 will not be learned as an occuring correlate of r. We're here in presence of a complete blocking phenomenon.

With $\epsilon$ = 0.05 :



In [9]:

    
from matplotlib import pyplot as plt
import numpy as np

v = 0  #  Prediction
w1 = 0  #  Weight 1
w2 = 0  #  Weight 2
epsi = 0.05  #  Learning rate
delta = 0  #  Prediction error

time = np.arange(0, 50, 1)  #  Time axis
yv = []  #  List of predictions
yu1 = []
yu2 = []
yw1 = []
yw2 = []

for i in range(50):
    if i < 25:
        u1 = 1  #  Stimulus 1
        u2 = 0  #  No stimulus 2
        r = 1  #  Reward
    else:
        u1 = 1  #  Stimulus 1
        u2 = 1  #  Stimulus 2
        r = 1  #  Reward
    v = w1*u1+w2*u2  #  Compute the prediction
    delta = r-v  #  Compute the prediction error
    w1 = w1+epsi*delta*u1  #  Compute Rescorla-Wagner rule for w1
    w2 = w2+epsi*delta*u2  #  Compute Rescorla-Wagner rule for w2
    yv.append(v)
    yu1.append(u1)
    yu2.append(u2)
    yw1.append(w1)
    yw2.append(w2)

plot = plt.subplot()
plot.bar(time, yv, label='v', width=1, color='blue')
plot.plot(time, yw1, '-g', label='w1', linewidth=2)
plot.plot(time, yw2, '-r', label='w2', linewidth=2)
plot.plot(time, yu1, 'xg', label='u1')
plot.plot(time, yu2, '+r', label='u2')
plot.legend(bbox_to_anchor=(1, 0.5))
plt.title('Prediction of reward over time')
plt.xlabel('Time')
plt.ylabel('Prediction (v)')
plt.ylim(-0.04, 1.04)
plt.show()

Here, because of a small $\epsilon$, the prediction is not equals to 1 at the 25th step. The prediction error is then positive and the Rescorla-Wagner rule can still modify the value of w1 and w2. In this case, like in the first case, u2 will be partially learned as a occuring correlate of the reward.

The blocking phenomenon is important because it is used as a counter-example to the classical and skinnerian models of conditioning. Those models imply that the second stimulus will be learned by association as a co-occurent event of the reward, but empirically animals don't learns in the conditions in which the blocking phenomenon occurs.

Overshadowing

Now we will test the overshadowing phenomenon. In this set up, both the stimuli will be present but w1 will be computed with an $\epsilon_1 = 0.2$ and w2 will be computed with an $\epsilon_2 = 0.1$.



In [10]:

    
from matplotlib import pyplot as plt
import numpy as np

v = 0  #  Prediction
w1 = 0  #  Weight 1
w2 = 0  #  Weight 2
epsi1 = 0.1  #  Learning rate 1
epsi2 = 0.2  #  Learning rate 2
delta = 0  #  Prediction error

time = np.arange(0, 50, 1)  #  Time axis
yv = []  #  List of predictions
yw1 = []
yw2 = []

for i in range(50):
    u1 = 1  #  Stimulus 1
    u2 = 1  #  No stimulus 2
    r = 1  #  Reward
    v = w1*u1+w2*u2  #  Compute the prediction
    delta = r-v  #  Compute the prediction error
    w1 = w1+epsi1*delta*u1  #  Compute Rescorla-Wagner rule for w1
    w2 = w2+epsi2*delta*u2  #  Compute Rescorla-Wagner rule for w2
    yv.append(v)
    yw1.append(w1)
    yw2.append(w2)

plot = plt.subplot()
plot.bar(time, yv, label='v', width=1, color='blue')
plot.plot(time, yw1, '-g', label='w1', linewidth=2)
plot.plot(time, yw2, '-r', label='w2', linewidth=2)
plot.legend(bbox_to_anchor=(1, 0.5))
plt.title('Prediction of reward over time')
plt.xlabel('Time')
plt.ylabel('Prediction (v)')
plt.ylim(-0.04, 1.04)
plt.show()

Both the stimuli will be learned as a co-occurent event of the reward. But, because w1 is computed with a higher learning rate than the learning rate associated with w2, w1 increases quicker than w2. This means that the animal's prediction of the reward is more determined by stimulus 1.



In [10]:



In [ ]: